96        Bioinformatics

• A file with “*stats.csv” or “*stats.tab” extension is the statistics of contig/scaffold

contiguity in CSV format. The assembly statistics generated by ABySS are shown in

Figure 3.6 and described in Table 3.1.

The contig/scaffold N50 metric is the most widely used metric for describing the quality

of a genome assembly. A contig/scaffold N50 is calculated by first ordering every contig/

scaffold by length from the longest to the shortest. Next, the lengths of contigs are summed

starting from the longest contig until the sum equals one-half of the total length of all con-

tigs in the assembly. The contig/scaffold N50 of an assembly is the length (bp) of the short-

est contig/scaffold from the sequences that form 50% of the assembly. To compare between

assemblies, the longer the N50 and the smaller the L50, the better the assembly.

In Figure 3.6, the scaffolds file (ecoli-scaffolds.fa) contains 836 sequences, of which

107 sequences are more than 500 bp. The shortest sequence has 584 bp and the longest is

267,586 bp. The N50 is the sequence of length 112,320 bp and L50 (the number of scaffolds

that accounts for more than 50% of the genome assembly) is 15.

Figure 3.7 shows a diagram explaining the major metrics of the genome assembly

(N25=55, N50=70, N75=75, L25=4, L50=6, and L75=7) and how they can be computed. In

the figure, there are eight contigs ranked from the smallest to the largest. The total number

of bases is 445 Mb (100%) and the half number is 222.5 Mb (50%).

You can display both contigs and scaffolds file on a Linux terminal using the “less”

Linux command as:

TABLE 3.1  Assembly Statistics

Column

Description

N

The total number of sequences in the FASTA file

n:500

The number of sequences whose lengths are not less than 500 bp

L50

The number of scaffolds that account for more than 50% of the assembly

LG50

The number of scaffolds that account for more than 50% of the genome assembly

NG50

The sequence length of the shortest contig at 50% of the total genome length

Min

The size of the smallest sequence

N75

The sequence length of the shortest contig at 75% of the total assembly length

N50

The sequence length of the shortest contig at 50% of the total assembly length

N25

The sequence length of the shortest contig at 25% of the total genome length

E-size

The sum of the square of the sequence sizes divided by the assembly size

Max

The size of the largest sequence

Sum

The sum of the sequence sizes

Name

The file name of the assembly

FIGURE 3.6  Assembly statistics.